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ABSTRACT 

In this paper, we present an algorithm for rapid Bayesian analysis that combines the benefits of 
nested sampling and artificial neural networks. The blind accelerated multimodal Bayesian in- 
ference (BAMBI) algorithm implements the MultiNest package for nested sampling as well 
as the training of an artificial neural network (NN) to learn the likelihood function. In the case 
of computationally expensive likelihoods, this allows the substitution of a much more rapid 
approximation in order to increase significantly the speed of the analysis. We begin by demon- 
strating, with a few toy examples, the ability of a NN to learn complicated likelihood surfaces. 
BAMBI's ability to decrease running time for Bayesian inference is then demonstrated in the 
context of estimating cosmological parameters from Wilkinson Microwave Anisotropy Probe 
and other observations. We show that valuable speed increases are achieved in addition to 
obtaining NNs trained on the likelihood functions for the different model and data combi- 
nations. These NNs can then be used for an even faster follow-up analysis using the same 
likelihood and different priors. This is a fully general algorithm that can be applied, without 
any pre-processing, to other problems with computationally expensive likelihood functions. 
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1 INTRODUCTION 

Bayesian methods of inference are widely used in astronomy and 
cosmology and are gaining popularity in other fields, such as par- 
ticle physics. They can generally be divided into the performance 
of two main tasks: parameter estimation and model selection. The 
former has traditionally been performed using Markov chain Monte 
Carlo (MCMC) methods, usually based on the Metropolis-Hastings 
algorithm or one of its variants. These can be computationally ex- 
pensive in their exploration of the parameter space and often need 
to be finely tuned in order to produce accurate results. Additionally, 
sampling efficiency can be seriously affected by multimodal distri- 
butions and large, curving degeneracies. The second task of model 
selection is further hampered by the need to calculate the Bayesian 
evidence accurately. The most common method of doing so is ther- 
modynamic integration, which requires several chains to be run, 
thus multiplying the computational expense. Fast methods of ev- 
idence calculation, such as assuming a Gaussian peak, clearly fail 
in multimodal and degenerate situations. Nested sampling ( |Skilling| 
[2004 1 is a method of Monte Carlo sampling designed for efficient 
calculation of the evidence which also provides samples from the 
posterior distribution as a by-product, thus allowing parameter es- 
timation at no additional cost. The MultiNest algorithm ( [Feroz | 
[& Hobson|2008| [Feroz et al.|2009al ) is a generic implementation 
of nested sampling, extended to handle multimodal and degenerate 
distributions, and is fully parallelised. 
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At each point in parameter space, Bayesian methods require 
the evaluation of a 'likelihood' function describing the probability 
of obtaining the data for a given set of model parameters. For some 
cosmological and particle physics problems each such function 
evaluation takes up to tens of seconds. MCMC applications may 
require millions of these evaluations, making them prohibitively 
costly. MultiNest is able to reduce the number of likelihood 
function calls by an order of magnitude or more, but further gains 
can be achieved if we are able to speed up the evaluation of the like- 
lihood itself. An artificial neural network (NN) is ideally suited for 
this task. A universal approximation theorem assures us that we can 
accurately and precisely approximate the likelihood with a NN of a 
given form. The training of NNs is one of the most widely studied 
problems in machine learning, so techniques for learning the likeli- 
hood function are well established. We implement a variant of con- 
jugate gradient descent to find the optimum set of weights for a NN, 
using regularisation of the likelihood and a Hessian-free second- 
order approximation to improve the quality of proposed steps to- 
wards the best fit. 

The blind accelerated multimodal Bayesian inference 
(BAMBI) algorithm combines these two elements. After a 
specified number of new samples from MultiNest have been 
obtained, BAMBI uses these to train a network on the likelihood 
function. After convergence to the optimal weights, we test that the 
network is able to predict likelihood values to within a specified 
tolerance level. If not, sampling continues using the original likeli- 
hood until enough new samples have been made for training to be 
done again. Once a network is trained that is sufficiently accurate. 
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its predictions are used in place of the original likelihood function 
for future samples for MultiNest. Using the network reduces the 
likelihood evaluation time from seconds to milliseconds, allowing 
MultiNest to complete the analysis much more rapidly. As a 
bonus, the user also obtains a network that is trained to easily 
and quickly provide more likelihood evaluations near the peak if 
needed, or in subsequent analyses. 

The structure of the paper is as follows. In Section [2] we will 
introduce Bayesian inference and the use of nested sampling. Sec- 
tion|3]will then explain the structure of a NN and how our optimiser 
works to find the best set of weights. We present some toy examples 
with BAMBI in Section |4]to demonstrate its capabilities; we apply 
BAMBI to cosmological parameter estimation in Section[5] In Sec- 
tion |6] we show the full potential speed-up from BAMBI, by using 
the trained NNs in a follow-up analysis. Section |7] summarises our 
work and presents our conclusions. 





(a) 



Figure 1. Cartoon illustrating (a) the posterior of a two dimensional prob- 
lem; and (b) the transformed C{X) function where the prior volumes Xi 
are associated with each likelihood Ci. Originally published in Feroz & 
Hobson (2008). 



2 BAYESIAN INFERENCE AND MULTINEST 
2.1 Theory of Bayesian Inference 

Bayesian statistical methods provide a consistent way of estimating 
the probability distribution of a set of parameters for a given 
model or hypothesis H given a data set D. B ayes' theorem states 
that 

where Pr(0|D, i^) is the posterior probability distribution of the 
parameters and is written as P(0), Pr(D|0, H) is the likelihood 
and is written as £(0), Pr(0|if) is the prior distribution and is 
written as 7r(0), and Pr(D|if) is the Bayesian evidence and is 
written as Z. The evidence is the factor required to normalise the 
posterior over 0, 

j £(0)7r(0)d^0, (2) 

where N is the dimensionality of the parameter space. Since the 
Bayesian evidence is independent of the parameter values, 0, it 
can be ignored in parameter estimation problems and the posterior 
inferences obtained by exploring the un-normalized posterior. 

Bayesian parameter estimation has achieved widespread use 
in many astrophysical applications. Standard Monte Carlo methods 
such as the Metropolis-Hastings algorithm or Hamiltonian sam- 
pling (see MacKay 2003 ) can experience problems with multi- 
modal likelihood distributions, as they can get stuck in a single 
mode. Additionally, long and curving degeneracies are difficult for 
them to explore and can greatly reduce sampling efficiency. These 
methods often require careful tuning of proposal jump distributions 
and testing for convergence can be problematic. Additionally, cal- 
culation of the evidence for model selection often requires running 
multiple chains, greatly increasing the computational cost. Nested 
sampling and the MULTINEST algorithm implementation address 
these problems. 



2.2 Nested Sampling 

Nested sampling ( Skilling'2004^ is a Monte Carlo method used for 
the computation of the evidence that can also provide posterior in- 
ferences. It transforms the multi-dimensional integral of Equation[2] 



into a one-dimensional integral over the prior volume. This is done 
by defining the prior volume X as dX = 7r(0)d^0. Therefore, 

X{\) = / 7r(0)d^0. (3) 

This integral extends over the region of parameter space contained 
within the likelihood contour £(0) = A. The evidence integral. 
Equation [2] can then be written as 

Z= f C{X)dX, (4) 

where C{X) is the inverse of Equation [5] and is a monotonically 
decreasing function of X. Thus, if we evaluate the likelihoods d — 
C{Xi), where Xi is a sequence of decreasing values, 

< Xm < ■ ■ ■ < X2 < Xi < Xo = 1. (5) 

The evidence can then be approximated numerically as a weighted 
sum 

M 

Z - ^^CiWi, (6) 

where the weights Wi for the simple trapezium rule are given by 
Wi = ^{Xi-i — Xi-^i). An example of a posterior in two dimen- 
sions and its associated function C{X) is shown in Figure[T] 

The fundamental operation of nested sampling begins with the 
initial, 'live', points being chosen at random from the entire prior 
volume. The lowest likelihood live point is removed and replaced 
by a new sample with higher likelihood. This removal and replace- 
ment of live points continues until a stopping condition is reached 
(MultiNest uses a tolerance on the evidence calculation). The 
difficult task lies in finding a new sample with higher likelihood 
than the discarded point. As the algorithm goes up in likelihood, the 
prior volume that will satisfy this condition decreases until it con- 
tains only a very small portion of the total parameter space, making 
this sampling potentially very inefficient. MultiNest tackles this 
problem by enclosing all of the active points in clusters of ellip- 
soids. New points can then be chosen from within these ellipsoids 
using a fast analytic function. Since the ellipsoids will decrease in 
size along with the distribution of live points, their surfaces in ef- 
fect represent likelihood contours of increasing value; the algorithm 
climbs up these contours seeking new points. As the clusters of el- 
lipsoids are not constrained to fit any particular distribution, they 
can easily enclose curving degeneracies and are able to separate 
out to allow for multimodal distributions. This separation also al- 
lows for the calculation of the 'local' evidence associated with each 
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Figure 2. A 3 -layer neural network with 3 inputs, 4 hidden nodes, and 2 
outputs. Image courtesy of Wikimedia Commons. 



mode. MultiNest has been shown to be of substantial use in as- 
trophysics and particle p hysics (see|Feroz et al.|2008a|b| |2009b|c| 
|20T0l|Trotta et al.|2008[|Gair et al.|201 0), typically showing great 
improvement in efficiency over traditional MCMC techniques. 



3 ARTIFICIAL NEURAL NETWORKS AND WEIGHTS 
OPTIMISATION 

Artificial neural networks are a method of computation loosely 
based on the structure of a brain. They consist of a group of inter- 
connected nodes, which process information that they receive and 
then pass this along to other nodes via weighted connections. We 
will consider only feed-forward NN, for which the structure is di- 
rected, with a layer of input nodes passing information to an output 
layer, via zero, one, or many hidden layers in between. A network is 
able 'learn' a relationship between inputs and outputs given a set of 
training data and can then make predictions of the outputs for new 
input data. Further introduction can be found in (MacKayl ( |2003| ). 

3.1 Network Structure 

A multilayer perceptron artificial neural network (NN) is the sim- 
plest type of network and consists of ordered layers of perceptron 
nodes that pass scalar values from one layer to the next. The percep- 
tron is the simplest kind of node, and maps an input vector x G 3^^^ 
to a scalar output /(x; w, ^) via 

n 

f{x;w,e) = e + Y,WiXi, (7) 

where {wi} and are the parameters of the perceptron, called the 
'weights' and 'bias', respectively. We will focus mainly on 3-layer 
NNs, which consist of an input layer, a hidden layer, and an output 
layer as shown in Figure [2] The outputs of the nodes in the hidden 
and output layers are given by the following equations: 

hidden layer: hj = /f ' = 6^^ + J2 ^J'^;,, (8) 

I 

output layer: = g<-'\ft^); /f ' = > +J2'^\fh,, (9) 

3 

where / runs over input nodes, j runs over hidden nodes, and i runs 
over output nodes. The functions and are called activation 



functions and must be bounded, smooth, and monotonic for our 
purposes. We use g^'^\x) — tanh(x) and g^'^\x) — x; the non- 
linearity of is essential to allowing the network to model non- 
linear functions. 

The weights and biases are the values we wish to determine in 
our training (described in Section [33] l. As they vary, a huge range 
of non-linear mappings from inputs to outputs is possible. In fact, 
a 'universal approximation theorem' (see |Hornik, Stinchcombe &] 
|White|p^90 t states that a NN with three or more layers can ap- 
proximate any continuous function as long as the activation func- 
tion is locally bounded, piecewise continuous, and not a polynomial 
(hence our use of tanh, although other functions would work just 
as well, such as a sigmoid). By increasing the number of hidden 
nodes, we can achieve more accuracy at the risk of overfitting to 
our training data. 

As long as the mapping from model parameters to predicted 
data is continuous - and it is in many cases - the likelihood func- 
tion will also be continuous. This makes a NN an ideal tool for 
approximating the likelihood. 



3.2 Choosing the Number of Hidden Layer Nodes 

An important choice when using a NN is the number of hidden 
layer nodes to use. We consider here just the case of three-layer 
networks with only one hidden layer. The optimal number is a com- 
plex relationship between of the number of training data points, the 
number of inputs and outputs, and the complexity of the function to 
be trained. Choosing too few nodes will mean that the NN is unable 
to learn the likelihood function to sufficient accuracy. Choosing too 
many will increase the risk of overfitting to the training data and 
will also slow down the training process. As a general rule, a NN 
should not need more hidden nodes than the number of training data 
points. We choose to use 50 or 100 nodes in the hidden layer in our 
examples. These choices allow the network to model the complex- 
ity of the likelihood surface and its functional dependency on the 
input parameters. The toy examples (Section |4| have fewer inputs 
that result in more complicated surfaces, but with simple functional 
relationships. The cosmological examples (Section |5]) have more 
inputs with complex relationships to generate a simple likelihood 
surface. In practice, it will be better to over-estimate the number of 
hidden nodes required. There are checks built in to prevent over- 
fitting and for computationally expensive likelihoods the additional 
training time will not be a large penalty if a usable network can be 
obtained earlier in the analysis. 



3.3 Network Training 

We wish to use a NN to model the likelihood function given some 
model and associated data. The input nodes are the model param- 
eters and the single output is the value of the likelihood function 
at that point. The set of training data, V = {x*^^^ , t*^^^ }, is the last 
updint number of points accepted by MultiNest, a parameter 
that is set by the user. This data is split into two groups randomly; 
approximately 80% is used for training the network and 20% is 
used as validation data to avoid overfitting. If training does not 
produce a sufficiently accurate network, MultiNest will obtain 
updInt/2 new samples before attempting to train again. Addi- 
tionally, the range of log-likelihood values must be within a user- 
specified range or NN training will be postponed. 
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3.3.1 Overview 

The weights and biases we will collectively call the network pa- 
rameter vector a. We can now consider the probability that a given 
set of network parameters is able to reproduce the known training 
data outputs - representing how well our NN model of the original 
likelihood reproduces the true values. This gives us a log-likelihood 
function for a, depending on a standard error function, given by 



log(£(a;a)) 



ii'log(27r) 



log((T) 



(10) 



where K is the number of data points and y{yS^^]ai) is the NN's 
predicted output value for the inputs x*^'^^ and network parameters 
a. The value of a is a hyper-parameter of the model that describes 
the standard deviation (error size) of the output. Our algorithm con- 
siders the parameters a to be probabilistic with a log-prior distribu- 
tion given by 



log(<S(a;a)) 



(11) 



a is a hyper-parameter of the model, called the 'regularisation con- 
stant', that gives the relative influence of the prior and the likeli- 
hood. The posterior probability of a set of NN parameters is thus 



Pr(a; a, a) oc £(a; a) x <S(a; a) 



(12) 



BAMBFs network training begins by setting random values 
for the weights, sampled from a normal distribution with zero 
mean. The initial value of a is set by the user and can be set on ei- 
ther the true log-likelihood values themselves or on their whitened 
values (whitening involves performing a linear transform such that 
the training data values have a mean of zero and standard devia- 
tion of one). The only difference between these two settings is the 
magnitude of the error used. The algorithm then calculates a large 
initial estimate for a. 



|Vlog(£)| 



(13) 



where M is the total number of weights and biases (NN parame- 
ters) and r is a rate set by the user (0 < r ^ 1, default r = 0.1) 
that defines the size of the 'confidence region' for the gradient. This 
formula for a sets larger regularisation ('damping') when the mag- 
nitude of the gradient of the likelihood is larger. This relates the 
amount of "smoothing" required to the steepness of the function 
being smoothed. The rate factor in the denominator allows us to 
increase the damping for smaller confidence regions on the value 
of the gradient. This results in smaller, more conservative steps that 
are more likely to result in an increase in the function value. 

BAMBI then uses conjugate gradients to calculate a step, Aa, 
that should be taken (see Section [3.3.21 ). Following a step, adjust- 
ments to a and a may be made before another step is calculated. 
The methods for calculating the initial a value and then determin- 
ing subsequent adjustments of a and/or a are as developed for the 
MemSys software package, described in |Gull & Skilling|fT999| ). 



information. Newton's method gives the second-order approxima- 
tion of a function, 

/(a + Aa) « /(a) + (V/(a))^Aa + ^ (Aa)^BAa, (14) 

where B is the Hessian matrix of second derivatives of / at a. In 
this approximation, the maximum of / will occur when 

V/(a + Aa) ^ V/(a) + BAa = 0. (15) 

Solving this for Aa gives us 

Aa= -B"V/(a). (16) 

Iterating this procedure will bring us eventually to the global max- 
imum value of /. For our purposes, the function / is the log- 
posterior distribution and hence Equation |T4] ) is a Gaussian ap- 
proximation to the posterior. The Hessian of the log-posterior is 
the regularised ('damped') Hessian of the log-likelihood function, 
where the prior - whose magnitude is set by a - provides the reg- 
ularisation. If we define the Hessian matrix of the log-likelihood as 
H, then B = H + al (I being the identity matrix). 

Using the second-order information provided by the Hessian 
allows for more efficient steps to be made, since curvature infor- 
mation can extend step sizes in directions where the gradient varies 
less and shorten where it is varying more. Additionally, using the 
Hessian of the log-posterior instead of the log-likelihood adds the 
regularisation of the prior, which can help to prevent getting stuck 
in a local maximum by smoothing out the function being explored. 
It also aids in reducing the 'region of confidence' for the gradient 
information which will make it less likely that a step results in a 
worse set of parameters. 

Given the form of the log-likelihood. Equation |T0] ), is a sum 
of squares (plus a constant), we can also save computational ex- 
pense by utilising the Gauss-Newton approximation of its Hessian, 
given by 



E 

fc=i 

K 



dai daj daidaj 



dai daj 



(17) 



where 



Vk 



(18) 



The drawback of using second-order information is that cal- 
culation of the Hessian is computationally expensive and requires 
large storage, especially so in many dimensions as we will en- 
counter for more complex networks. In general, the Hessian is 
not guaranteed to be positive semi-definite and so may not be in- 
vertible; however, the Gauss-Netwon approximation does have this 
guarantee. Inversion of the very large matrix will still be computa- 
tionally expensive. As noted in Martens] ( [2010 ) however, we only 
need products of the Hessian with a vector to compute the solution, 
never actually the full Hessian itself. To calculate these approxi- 
mate Hessian- vector products, we use a fast approximate method 
given in |Schraudolph| p002| ) and [Pearlmutter] ( |1994 l Combining 
all of these methods makes second-order information practical to 



3.3.2 Finding the next step 

In order to find the most efficient path to an optimal set of parame- 
ters, we perform conjugate gradients using second-order derivative 



3.3.3 Convergence 

Following each step, the posterior, likelihood, and correlation val- 
ues are calculated for the training data and the validation data that 



© 2011 RAS, MNRAS 000,[l]{T2| 



BAMBI 5 



was not used in training (calculating the steps). Convergence to a 
best-fit set of parameters is determined by maximising the poste- 
rior, likelihood, or correlation of the validation data, as chosen by 
the user. This prevents overfitting as it provides a check that the 
network is still valid on points not in the training set. We use the 
correlation as the default function to maximise as it does not in- 
clude the model hyper-parameters in its calculation. 



3.4 When to Use the Trained Network 

The optimal network possible with a given set of training data may 
not be able to predict likelihood values accurately enough, so an ad- 
ditional criterion is placed on when to use the trained network. This 
requirement is that the standard deviation of the difference between 
predicted and true log-likelihood values is less than a user-specified 
tolerance. When the trained network does not pass this test, then 
BAMBI will continue using the original log-likelihood function to 
obtain updInt/2 new samples to generate a new training data set 
of the last updint accepted samples. Network training will then 
resume, beginning with the weights that it had found as optimal for 
the previous data set. Since samples are generated from nested con- 
tours and each new data set contains half of the previous one, the 
saved network will already be able to produce reasonable predic- 
tions on this new data; resuming therefore enables us to save time 
as fewer steps will be required to reach the new optimum weights. 

Once a NN is in use in place of the original log-likelihood 
function, checks are made to ensure that the network is main- 
taining its accuracy. If the network makes a prediction outside of 
[min(training) — a, max(training) + a], then that value is discarded 
and the original log-likelihood function is used for that point. Ad- 
ditionally, the central 95* percentile of the output log-likelihood 
values from the training data used is calculated and if the network 
is making predictions mostly outside of this range then it will be re- 
trained. To re-train the network, BAMBI first substitutes the origi- 
nal log-likelihood function back in and gathers the required number 
of new samples from MultiNest. Training then commences, re- 
suming from the previously saved network. These criteria ensure 
that the network is not trusted too much when making predictions 
beyond the limits of the data it was trained on, as we cannot be sure 
that such predictions are accurate. 



4 BAMBI TOY EXAMPLES 

In order to demonstrate the ability of BAMBI to learn and accu- 
rately explore multimodal and degenerate likelihood surfaces, we 
first tested the algorithm on a few toy examples. The eggbox like- 
lihood has many separate peaks of equal likelihood, meaning that 
the network must be able to make predictions across many differ- 
ent areas of the prior. The Gaussian shells likelihood presents the 
problem of making predictions in a very narrow and curving region. 
Lastly, the Rosenbrock function gives a long, curving degeneracy 
that can also be extended to higher dimensions. They all require 
high accuracy and precision in order to reproduce the posterior 
truthfully and each presents unique challenges to the NN in learn- 
ing the likelihood. It is important to note that running BAMBI on 
these problems required more time than the straightforward anal- 
ysis; this was as expected since the actual likelihood functions are 
simple analytic functions that do not require much computational 
expense. 




Figure 3. The eggbox log-likelihood surface, given by Equation {19). 

Method log(^) 

Analytical 235.88 
MultiNest 235.859 ± 0.039 
BAMBI 235.901 ± 0.039 

Table 1. The log-evidence values of the eggbox likelihood as found analyt- 
ically and with MultiNest and BAMBI. 

4.1 Eggbox 

This is a standard example of a very multimodal likelihood distri- 
bution in two dimensions. It has many peaks of equal value, so the 
network must be able to take samples from separated regions of 
the prior and make accurate predictions in all peaks. The eggbox 
likelihood ( [Feroz et al.|2009a| ) is given by 

=exp [(2 + cos(f)cos(f))'] , (19) 

where we take a uniform prior Z//(0, IOtt) for both x and y. The 
structure of the surface can be seen in Figure [3] 

We ran the eggbox example in both MultiNest and BAMBI, 
both using 4000 live points. For BAMBI, we used 4000 samples 
for training a network with 50 hidden nodes. In Table [T] we report 
the evidences recovered by both methods as well as the true value 
obtained analytically from Equation ([19]). Both methods return ev- 
idences that agree with the analytically determined value to within 
the given error bounds. Figure |4] compares the posterior probabil- 
ity distributions returned by the two algorithms via the distribu- 
tion of lowest-likelihood points removed at successive iterations by 
MultiNest. We can see that they are identical distributions; there- 
fore, we can say that the use of the NN did not reduce the quality 
of the results either for parameter estimation or model selection. 
During the BAMBI analysis 51.3% of the log-likelihood function 
evaluations were done using the NN; if this were a more compu- 
tationally expensive function, significant speed gains would have 
been realised. 

4.2 Gaussian Shells 

The Gaussian shells likelihood function has low values over most 
of the prior, except for thin circular shells that have Gaussian cross- 
sections. We use two separate Gaussian shells of equal magnitude 
so that this is also a mutlimodal inference problem. Therefore, our 



© 2011 RAS, MNRAS 000,[l]{T2] 



6 Grajf, Feroz, Hobson, and Lasenby 





(a) 



(b) 



Figure 4. Points of lowest likelihood of the eggbox likelihood from succes- 
sive iterations as given by (a) MultiNest and (b) BAMBI. 



Method 



log(^) 



Analytical 
MultiNest 
BAMBI 



-1.75 
-1.768 ±0.052 
-1.757 ±0.052 



Table 2. The log-evidence values of the Gaussian shell likelihood as found 
analytically and with MultiNest and BAMBI. 




(a) 



(b) 



Figure 5. The Gaussian shell likelihood surface, given by Equations j20} 
and (21). 



Gaussian shells likelihood is 

i2(x) = circ(x; ci, n, il^i) + circ(x; C2, r2, 



where each shell is defined by 
1 



circ(x; c, r, w) 



: exp 



(|x- 



(20) 



(21) 



This is shown in Figure [5] 

As with the eggbox problem, we analysed the Gaussian shells 
likelihood with both MultiNest and BAMBI using uniform pri- 
ors Z//(— 6, 6) on both dimensions of x. MultiNest sampled with 
1000 live points and BAMBI used 2000 samples for training a net- 
work with 100 hidden nodes. In Table|2]we report the evidences re- 
covered by both methods as well as the true value obtained analyti- 
cally from Equations ( |20] ) and ( [2T] ). The evidences are both consis- 
tent with the true value. Figure|6]compares the posterior probability 
distributions returned by the two algorithms (in the same manner as 
with the eggbox example). Again, we see that the distribution of re- 
turned values is nearly identical when using the NN, which BAMBI 
used for 18.2% of its log-likelihood function evaluations. This is a 
significant fraction, especially since they are all at the end of the 
analysis when exploring the peaks of the distribution. 

4.3 Rosenbrock function 

The Rosenbrock function is a standard example used for testing op- 
timization as it presents a long, curved degeneracy through all di- 
mensions. For our NN training, it presents the difficulty of learning 
the likelihood function over a large, curving region of the prior. We 



Figure 6. Points of lowest likelihood of the Gaussian shell likelihood from 
successive iterations as given by (a) MultiNest and (b) BAMBI. 



use the Rosenbrock function to define the negative log-likelihood, 
so the likelihood function is given in N dimensions by 



£(x) = exp i - ^ [(1 - x^f + 100(x^+i 



2n21 



(22) 



Figure [7] shows how this appears for = 2. 

We set uniform priors of ^(—5, 5) in all dimensions and per- 
formed analysis with both MultiNest and BAMBI with N = 2 
and N = 10. For AT = 2, MultiNest sampled with 2000 live 
points and BAMBI used 2000 samples for training a NN with 50 
hidden-layer nodes. With N = 10, we used 2000 live points, 6000 
samples for network training, and 50 hidden nodes. Table |3] gives 
the calculated evidences values returned by both algorithms as well 
as the analytically calculated values from Equation ( [22] ) (there does 
not exist an analytical solution for the lOD case so this is not in- 
cluded). Figure [8] compares the posterior probability distributions 
returned by the two algorithms for N — 2. For N — 10, we show 
in Figure |9] comparisons of the marginalised two-dimensional pos- 
terior distributions for 12 variable pairs. We see that MultiNest 
and BAMBI return nearly identical posterior distributions as well 




Figure 7. The Rosenbrock log-likelihood surface given by Equation j22J 
with = 2. 
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Method log(2) 



Analytical 2D 


-5.804 


MultiNest 2D 


-5.799 ±0.049 


BAMBI 2D 


-5.757 ±0.049 


MultiNest lOD 


-41.54 ±0.13 


BAMBI lOD 


-41.53 ±0.13 



Table 3. The log-evidence values of the Rosenbrock likelihood as found 
analytically and with MultiNest and BAMBI. 



Parameter 


Min 


Max 




0.018 


0.032 




0.04 


0.16 


9 


0.98 


1.1 


T 


0.01 


0.5 




-0.1 


0.1 


ris 


0.8 


1.2 


log[10iOA,] 


2.7 


4 







2 



Table 4. The cosmological parameters and their minimum and maximum 
values. Uniform priors were used on all variables. Qk was set to for the 
flat model. 




(a) (b) 

Figure 8. Points of lowest likelihood of the Rosenbrock likelihood for = 
2 from successive iterations as given by (a) MultiNest and (b) BAMBI. 



as consistent estimates of the evidence. For N — 2 and = 10, 
BAMBI was able to use a NN for 64.7% and 30.5% of its log- 
likelihood evaluations respectively. Even when factoring in time 
required to train the NN this would have resulted in large decreases 
in running time for a more computationally expensive likelihood 
function. 



5 COSMOLOGICAL PARAMETER ESTIMATION WITH 
BAMBI 

While likelihood functions resembling our previous toy examples 
do exist in real physical models, we would also like to demonstrate 
the usefulness of BAMBI on simpler likelihood surfaces where the 
time of evaluation is the critical limiting factor. One such example 
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Figure 9. Marginalised 2D posteriors for the Rosenbrock function with 
= 10. The 12 most correlated pairs are shown. MultiNest is in solid 
black, BAMBI in dashed blue. Inner and outer contours represent 68% and 
95% confidence levels, respectively. 



in astrophysics is that of cosmological parameter estimation and 
model selection. 

We implement BAMBI within the standard COSMOMC 
code ( [Lewis & Bridle|2002| , which by default uses MCMC sam- 
pling. This allows us to compare the performance of BAMBI with 
other methods, such as MultiNest ( ,Feroz et al.pOOQaj , Cos- 
moNet ( |Auld et al.|[2007l [2008]), In terpMC ( [Bouland et ah] 
|20TT] ), PICO fFendt & W andelt||200 6^, and others. In this paper, 
we will only report performances of BAMBI and MultiNest, but 
these can be compared with reported performance from the other 
methods. 

Bayesian parameter estimation in cosmology requires evalua- 
tion of theoretical temperature and polarisation CMB power spec- 
tra (Ci values) using codes such as CAMB (Lewis et al. 2000). 
These evaluations can take on the order of tens of seconds depend- 
ing on the cosmological model. The Ci spectra are then compared 
to WMAP and other observations for the likelihood function. Con- 
sidering that thousands of these evaluations will be required, this is 
a computationally expensive step and a limiting factor in the speed 
of any Bayesian analysis. With BAMBI, we use samples to train 
a NN on the combined likelihood function which will allow us to 
forgo evaluating the full power spectra. This has the benefit of not 
requiring a pre-computed sample of points as in CosmoNet and 
PICO, which is particularly important when including new param- 
eters or new physics in the model. In these cases we will not know 
in advance where the peak of the likelihood will be and it is around 
this location that the most samples would be needed for accurate 
results. 

The set of cosmological parameters that we use as variables 
and their prior ranges is given in T able [4} the paramet ers have their 
usual meanings in cosmology (see |Larson et al.|2 0ir table 1). The 
prior probability distributions are uniform over the ranges given. A 
non-flat cosmological model incorporates all of these parameters, 
while we set Qk — for a flat model. We use w — —1 in both 
cases. The flat model thus represents the standard ACDM cosmol- 
ogy. We use two different data sets for analysis: (1) CMB obser- 
vations alone and (2) CMB observations plus Hubble Space Tele- 
scope constraints on Ho, large-scale structure constraints from the 
luminous red galaxy subset of the SDSS and the 2dF survey, and 
supernovae la data. The CMB dataset consists of WMAP seven- 
year data (Larson et al."20ir) and higher resolution observations 
from the ACBAR, CBI, and BOOMERanG experiments. The COS- 
MOMC website (see |Lewis & Bridle|2002 ) provides full references 
for the most recent sources of these data. 

Analyses with MultiNest and BAMBI were run on all four 
combinations of models and data sets. MultiNest sampled with 
1000 live points and an efficiency of 0.5, both on its own and within 
BAMBI; BAMBI used 2000 samples for training a NN on the like- 
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Algorithm 


Model 


Data Set 


log(^) 


MultiNest 


ACDM 


CMB only 


-3754.58 ± 0.12 


BAMBI 


ACDM 


CMB only 


-3754.57 ± 0.12 


MultiNest 


ACDM 


all 


-4124.40 ±0.12 


BAMBI 


ACDM 


all 


-4124.11 ±0.12 


MultiNest 




CMB only 


-3755.26 ±0.12 


BAMBI 


ACDM+17k 


CMB only 


-3755.57 ±0.12 


MultiNest 




all 


-4126.54 ±0.13 


BAMBI 


ACDM+f^K 


all 


-4126.35 ±0.13 



Table 5. Evidences calculated by Multinest and BAMBI for the flat 
(ACDM) and non-flat {KCDM+^k) models using the CMB-only and 
complete data sets. The two algorithms are in close agreement in all cases. 




0.02 0.022 2 0.024 0.026 0.08 0.1 ^ 




1.03 1.035 1.04 1.045 1.05 1.055 0.05 0.1 0.15 0.2 

e T 




-0.1 -0.05 0.05 0.9 0.95 1 1.05 1.1 




2.95 3 3.05 ,.3.1 3.15 3.2 0.5 1 1.5 2 

log(lo'X) 



Figure 10. Marginalised ID posteriors for the non-flat model 
{ACDM+VLk) using only CMB data. MultiNest is in solid black, 
BAMBI in dashed blue. 



lihood, with 50 hidden-layer nodes for both the flat model and non- 
flat model. Table [5] reports the recovered evidences from the two 
algorithms for both models and both data sets. It can be seen that 
the two algorithms report equivalent values to within statistical er- 
ror for all four combinations. In Figures [T0| and [TT] we show the 
recovered one- and two-dimensional marginalised posterior proba- 
bility distributions for the non-flat model using the CMB-only data 
set. Figures [12] and [13] show the same for the non-flat model us- 
ing the complete data set. We see very close agreement between 
MultiNest (in solid black) and BAMBI (in dashed blue) across 
all parameters. The only exception is Asz since it is unconstrained 
by these models and data and is thus subject to large amounts of 
variation in sampling. The posterior probability distributions for 
the flat model with either data set are extremely similar to those of 
the non-flat flodel with setting Q.k = 0, as expected, so we do not 
show them here. 

A by-product of running BAMBI is that we now have a net- 
work that is trained to predict likelihood values near the peak of 
the distribution. To see how accurate this network is, in Figure [T4] 
we plot the error in the prediction (A log(£) = log (/^predicted) — 
log(£true)) versus the true log-likelihood value for the different sets 
of training and validation points that were used. What we show are 
results for networks that were trained to sufficient accuracy in order 
to be used for making likelihood predictions; this results in two net- 
works for each case shown. We can see that although the flat model 
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Figure 11. Marginalised 2D posteriors for the non-flat model 
(ACDM±r2x) using only CMB data. The 12 most correlated pairs 
are shown. MultiNest is in solid black, BAMBI in dashed blue. Inner 
and outer contours represent 68% and 95% confidence levels, respectively. 
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Figure 12. Marginalised ID posteriors for the non-flat model 
{ACTM+VLk) using the complete data set. MultiNest is in soHd 
black, BAMBI in dashed blue. 



used the same number of hidden-layer nodes, the simpler physical 
model allowed for smaller error in the likelihood predictions. Both 
final networks (one for each model) are significantly more accu- 
rate than the specified tolerance of a maximum standard deviation 
on the error of 0.5. In fact, for the flat model, all but one of the 
2000 points have an error of less than 0.06 log-units. The two net- 
works trained in each case overlap in the range of log-likelihood 
values on which they trained. The first network, although trained to 
lower accuracy, is valid over a much larger range of log-likelihoods. 
The accuracy of each network increases with increasing true log- 
likelihood and the second network, trained on higher log-likelihood 
values, is significantly more accurate than the first. 

The final comparison, and perhaps the most important, is the 
running time required. The analyses were run using MPI paral- 
lelisation on 48 processors. We recorded the time required for the 
complete analysis, not including any data initialisation prior to ini- 
tial sampling. We then divide this time by the number of likeli- 
hood evaluations performed to obtain an average time per likeli- 
hood (twaii clock, sec X A^cpus/iViog(/:) evais)- Therefore, time required 
to train the NN is still counted as a penalty factor. If a NN takes 
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0.022 0.023 0.024 0.10S.H).11S.12.125 1.035 1.04 1.045 
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Figure 13. Marginalised 2D posteriors for the non-flat model 
(ACDM+r^x) using the complete data set. The 12 most correlated 
pairs are shown. MultiNest is in solid black, BAMBI in dashed blue. 
Inner and outer contours represent 68% and 95% confidence levels, 
respectively. 



Model 


Data set 


MultiNest 


BAMBI 


Speed 






tc (s) 


tc (s) 


factor 


ACDM 


CMB only 


2.394 


1.902 


1.26 


ACDM 


all 


3.323 


2.472 


1.34 


ACDM+17k 


CMB only 


12.744 


9.006 


1.42 


ACDM+17k 


all 


12.629 


10.651 


1.19 



Table 6. Time per likelihood evaluation, factor of speed increase from 
MultiNest to BAMBI (tMN/^BAMBi). 



Model Data set % log(£) Equivalent Actual 

with NN speed factor speed factor 



ACDM 
ACDM 
ACDM+17k 
ACDM+^7K 



CMB only 
all 

CMB only 
all 



40.5 
40.2 
34.2 
30.0 



1.68 
1.67 
1.52 
1.43 



1.26 
1.34 
1.42 
1.19 



Table 7. Percentage of likelihood evaluations performed with a NN, equiv- 
alent speed factor, and actual factor of speed increase. 



-4180 -4160 -4140 -4120 
log(L) 



-4111 -4110-4109-4108 -4107-4106 -4105 
log(L) 



-4150 -4140 -4130 -4120 -4110 
log(L) 



mo -4109 -4108 -4107 -4106 -4105 
log(L) 



Figure 14. The error in the predicted likelihood (Alog(£) = 
log (/^predicted) — log(>^true)) for the BAMBI networks trained on the flat 
(top row) and non-flat (bottom row) models using the complete data set. 
The left column represents predictions from the first NN trained to suffi- 
cient accuracy; the right column are results from the second, and final, NN 
trained in each case. The flat and non-flat models both used 50 hidden-layer 
nodes. 



more time to train, this will hurt the average time, but obtaining a 
usable NN sooner and with fewer training calls will give a better 
time since more likelihoods will be evaluated by the NN. The re- 
sulting average times per likelihood and speed increases are given 
in Table [6] Although the speed increases appear modest, one must 
remember that these include time taken to train the NNs, during 
which no likelihoods were evaluated. This can be seen in that al- 
though 30-40% of likelihoods are evaluated with a NN, as reported 
in Table [7] we do not obtain the full equivalent speed increase. We 
are still able to obtain a significant decrease in running time while 
adding in the bonus of having a NN trained on the likelihood func- 
tion. 



6 USING TRAINED NETWORKS FOR FOLLOW-UP IN 
BAMBI 

A major benefit of BAMBI is that following an initial run the user 
is provided with a trained NN, or multiple ones, that model the 
log-likelihood function. These can be used in a subsequent analysis 
with different priors to obtain much faster results. This is a com- 
parable analysis to that of CosmoNet ( Auld et aL||2007| 12008") ), 
except that the NNs here are a product of an initial Bayesian anal- 
ysis where the peak of the distribution was not previously known. 
No prior knowledge of the structure of the likelihood surface was 
used to generate the networks that are now able to be re-used. 

When multiple NNs are trained and used in the initial BAMBI 
analysis, we must determine which network's prediction to use in 
the follow-up. The approximate error of uncertainty of a NN's pre- 
diction of the value a) (x denoting input parameters, a NN 
weights and biases as before) that models the log-likelihood func- 
tion is given by [MacKay] ( | 1 995| ) as 



2 2 
CTpred + <^z/? 



where 



2 

^pred 



g 



(23) 



(24) 



and al is the variance of the noise on the output from the network 
training. In Equation ( [24] ), B is the Hessian of the log-posterior as 
before, and g is the gradient of the NN's prediction with respect to 
the weights about their maximum posterior values. 



aj/(x; a) 

B — |x,aMP- 



(25) 



This uncertainty of the prediction is calculated for each training and 
validation data point used in the initial training of the NN for each 
saved NN. The threshold for accepting a predicted point from a 
network is then set to be 1.2 times the maximum uncertainty found. 

A log-likelihood is calculated by first making a prediction with 
the final NN to be trained and saved and then calculating the er- 
ror for this prediction. If the error, cTpred, is greater than that NN's 
threshold, then we consider the previous trained NN. We again cal- 
culate the predicted log-likelihood value and error to compare with 
its threshold. This continues to the first NN saved until a NN makes 
a sufficiently confident prediction. If no NNs can make a confident 
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Figure 15. Predictions with uncertainty error bars for NNs saved by 
BAMBI when analysing the non-flat model using the complete data set. The 
left-hand side shows predictions with errors for the two NNs on their own 
and the other's validation data sets. Each network's own points are in blue 
-i-s, the other NN's points are in red crosses. Many error bars are too small 
to be seen. The right-hand side, using the same colour and label scheme, 
shows the magnitudes of the error bars from each NN on the predictions. 



enough prediction, then we set \og{C) — — oo. This is justified 
because the NNs are trained to predict values in the highest like- 
lihood regions of parameter space and if a set of parameters lies 
outside their collective region of validity, then it must not be within 
the region of highest likelihoods. 

To demonstrate the speed-up potential of using the NNs, we 
first ran an analysis of the cosmological parameter estimation us- 
ing both models and both data sets. This time, however, we set the 
tolerance of the NNs to 1.0 instead of 0.5, so that they would be 
valid over a larger range of log-likelihoods and pass the accuracy 
criterion sooner. Each analysis produced two trained NNs. We then 
repeated each of the four analyses, but set the prior ranges to be uni- 
form over the region defined by Xinax(iog(/:)) i 4cr, where a is the 
vector of standard deviations of the marginalised one-dimensional 
posterior probabilities. 

In Figure [15] we show predictions from the two trained NNs 
on the two sets of validation data points in the case of the non-flat 
model using the complete data set. In the left-hand column, we can 
see that the first NN trained is able to make reasonable predictions 
on its own validation data as well as on the second set of points, in 
red crosses, from the second NN's validation data. The final NN is 
able to make more precise predictions on its own data set than the 
initial NN, but is unable to make accurate predictions on the first 
NN's data points. The right-hand column shows the error bar sizes 
for each of the points shown. For both NNs, the errors decrease with 
increasing log-likelihood. The final NN has significantly lower un- 
certainty on predictions for its own validation data, which enables 
us to set the threshold for when we can trust its prediction. The 
cases for the other three sets of cosmological models and data sets 
are very similar to this one. This demonstrates the need to use the 
uncertainty error measurement in determining which NN predic- 
tion to use, if any. Always using the final NN would produce poor 
predictions away from the peak and the initial NN does not have 
sufficient precision near the peak to properly measure the best-fit 
cosmological parameters. But by choosing which NN's prediction 
to accept, as we have shown, we can quickly and accurately repro- 
duce the likelihood surface for sampling. Furthermore, if one were 
interested only in performing a re-analysis about the peak, then one 



Figure 16. Marginalised ID posteriors for the non-flat model 
(ACDM+r^x) using the complete data set. BAMBI's initial run is 
in solid black, the follow-up analysis in dashed blue. 




0.1 0.11 .pA2 0.05 0.1 0.95 1 



Figure 17. Marginalised 2D posteriors for the non-flat model 
(ACDM+r^x) using the complete data set. The 12 most correlated 
pairs are shown. BAMBI's initial run is in solid black, the follow-up 
analysis in dashed blue. Inner and outer contours represent 68% and 95% 
confidence levels, respectively. 



could use just the final NN, thereby omitting the calculational over- 
head associated with choosing the appropriate network. 

For this same case, we plot the one- and two-dimensional 
marginalised posterior probabilities in Figures [16] and [17] respec- 
tively. Although the priors do not cover exactly the same ranges, 
we expect very similar posterior distributions since the priors are 
sufficiently wide as to encompass nearly all of the posterior proba- 
bility. We see this very close agreement in all cases. 

Calculating the uncertainty error requires calculating approx- 
imate inverse Hessian-vector products which slow down the pro- 
cess. We sacrifice a large factor of speed increase in order to main- 
tain the robustness of our predictions. Using the same method as 
before, we computed the time per likelihood calculation for the ini- 
tial BAMBI run as well as the follow up; these are compared in 
Table [8] We can see that in addition to the initial speed-up obtained 
with BAMBI, this follow-up analysis obtains an even larger speed- 
up in time per likelihood calculation. This speed-up is especially 
large for the non-flat model, where CAMB takes longer to com- 
pute the CMB spectra. The speed factor also increases when using 
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Model 


Data set 


Initial 


Follow-up 


Speed 






tc (s) 


tc (s) 


factor 


ACDM 


CMB only 


1.635 


0.393 


4.16 


ACDM 


all 


2.356 


0.449 


5.25 


ACDM+17k 


CMB only 


9.520 


0.341 


27.9 




all 


8.640 


0.170 


50.8 



Table 8. Time per likelihood evaluation, factor of speed increase from 
BAMBI's initial run to a follow-up analysis. 



the complete data set, as the original likelihood calculation takes 
longer than for the CMB -only data set; NN predictions take equal 
time regardless of the data set. 

One possible way to avoid the computational cost of com- 
puting error bars on the predictions is that suggested by MacKay| 
(J^95 ). One can take the NN training data and add Gaussian noise 
and train a new NN, using the old weights as a starting point. Per- 
forming many realisations of this will quickly provide multiple 
NNs whose average prediction will be a good fit to the original 
data and whose variance from this mean will measure the error in 
the prediction. This will reduce the time needed to compute an er- 
ror bar since multiple NN predictions are faster than a single in- 
verse Hessian-vector product. Investigation of this technique will 
be explored in a future work. 



7 SUMMARY AND CONCLUSIONS 

We have introduced and demonstrated a new algorithm for 
rapid Bayesian data analysis. The Blind Accelerated Multimodal 
Bayesian Inference algorithm combines the sampling efficiency of 
MultiNest with the predictive power of artificial neural networks 
to reduce significantly the running time for computationally expen- 
sive problems. 

The first applications we demonstrated are toy examples that 
demonstrate the ability of the NN to learn complicated likelihood 
surfaces and produce accurate evidences and posterior probability 
distributions. The eggbox, Gaussian shells, and Rosenbrock func- 
tions each present difficulties for Monte Carlo sampling as well as 
for the training of a NN. With the use of enough hidden-layer nodes 
and training points, we have demonstrated that a NN can learn to 
accurately predict log-likelihood function values. 

We then apply BAMBI to the problem of cosmological pa- 
rameter estimation and model selection. We performed this using 
flat and non-flat cosmological models and incorporating only CMB 
data and using a more extensive data set. In all cases, the NN is able 
to learn the likelihood function to sufficient accuracy after train- 
ing on early nested samples and then predict values thereafter. By 
calculating a significant fraction of the likelihood values with the 
NN instead of the full function, we are able to reduce the running 
time by a factor of 1.19 to 1.42. This is in comparison to use of 
MultiNest only, which already provides significant speed-ups in 
comparison to traditional MCMC methods (see |Feroz et al.|20()9a| . 

Through all of these examples we have shown the capability 
of BAMBI to increase the speed at which Bayesian inference can 
be done. This is a fully general method and one need only change 
the settings for MultiNest and the network training in order to 
apply it to different likelihood functions. For computationally ex- 
pensive likelihood functions, the network training takes less time 
than is required to sample enough training points and sampling a 
point using the network is extremely rapid as it is a simple analytic 



function. Therefore, the main computational expense of BAMBI is 
calculating training points while the sampling evolves until the net- 
work is able to reproduce the likelihood accurately enough. With 
the trained NN, we can now perform additional analyses using 
the same likelihood function but different priors and save large 
amounts of time in sampling points with the original likelihood and 
in training a NN. Follow-up analyses using already trained NNs 
provides much larger speed increases, with factors of 4 to 50 ob- 
tained for cosmological parameter estimation relative to BAMBI 
speeds. The limiting factor in these runs is the calculation of the 
error of predictions, which is a flat cost based on the size of the NN 
and data set, regardless of the original likelihood function. 

The NNs trained by BAMBI for cosmology cover a larger 
range of log-likelihoods than the one trained for CosmoNet. This 
allows us to use a wider range of priors for subsequent analysis and 
not be limited to the four-sigma region around the maximum like- 
lihood point. By setting the tolerance for BAMBI's NNs to a larger 
value, fewer NNs with larger likelihood ranges can be trained, al- 
beit with larger errors on the predictions. Allowing for larger priors 
requires us to test the validity of our NNs' approximations, which 
ends up slowing the overall analysis. 

Since BAMBI uses a NN to calculate the likelihood at later 
times in the analysis where we typically also suffer from lower 
sampling efficiency (harder to find a new point with higher like- 
lihood than most recent point removed), we are more easily able to 
implement Hamiltonian Monte Carlo ( |Betancoiirt||2010| ) for find- 
ing a proposed sample. This method uses gradient information to 
make better proposals for the next point that should be 'sampled'. 
Calculating the gradient is usually a difficult task, but with the NN 
approximation they are very fast and simple. This improvement will 
be investigated in future work. 

As larger data sets and more complicated models are used in 
cosmology, particle physics, and other fields, the computational 
cost of Bayesian inference will increase. The BAMBI algorithm 
can, without any pre-processing, significantly reduce the required 
running time for these inference problems. In addition to provid- 
ing accurate posterior probability distributions and evidence calcu- 
lations, the user also obtains a NN trained to produce likelihood 
values near the peak(s) of the distribution that can be used in even 
more rapid follow-up analysis. 
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